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Abstract 

As language data and associated 
technologies proliferate and as the 
language resources community rapidly 
expands, it has become difficult to 
locate and reuse existing resources. 
Are there any lexical resources for 
such-and-such a language? What 
tool can work with transcripts in 
this particular format? What is a 
good format to use for linguistic data 
of this type? Questions like these 
dominate many mailing lists, since 
web search engines are an unreliable 
way to find language resources. 
This paper describes a new digital 
infrastructure for language resource 
discovery, based on the Open Archives 
Initiative, and called OLAC - the 
Open Language Archives Community. 
The OLAC Metadata Set and the 
associated controlled vocabularies 
facilitate consistent description and 
focussed searching. We report progress 
on the metadata set and controlled 
vocabularies, describing current issues 
and soliciting input from the language 
resources community. 

1 Introduction 

Language technology and the linguistic sciences 
are confronted with a vast array of language 
resources, richly structured, large and diverse. 
Multiple communities depend on language 



resources, including linguists, engineers, 
teachers and actual speakers. Many individuals 
and institutions provide key pieces of the 
infrastructure, including archivists, software 
developers, and publishers. Today we have 
unprecedented opportunities to connect these 
communities to the language resources they 
need. First, inexpensive mass storage technology 
permits large resources to be stored in digital 
form, while the Extensible Markup Language 
(XML) and Unicode provide flexible ways 
to represent structured data and ensure its 
long-term survival. Second, digital publication 
- both on and off the world wide web - is the 
most practical and efficient means of sharing 
language resources. Finally, a standard resource 
description model, the Dublin Core Metadata Set, 
together with an interchange method provided 
by the Open Archives Initiative (OAI), make 
it possible to construct a union catalog over 
multiple repositories and archives. 

In December 2000, an NSF-funded workshop 
on Web-Based Language Documentation and 
Description, held in Philadelphia, brought 
together a group of nearly 100 language 
software developers, linguists, and archivists 
who are responsible for creating language 
resources in North America, South America, 
Europe, Africa, the Middle East, Asia and 
Australia http : / /www . ldc . upenn . edu/ 
exploration/expl2000/. The outcome 
of the workshop was the founding of the Open 
Language Archives Community (OLAC), an 
application of the OAI to digital archives of 
language resources, with the following purpose: 



OLAC, the Open Language Archives 
Community, is an international 
partnership of institutions and 
individuals who are creating a 
worldwide virtual library of language 
resources by: (i) developing consensus 
on best current practice for the digital 
archiving of language resources, 
and (ii) developing a network of 
interoperating repositories and services 
for housing and accessing such 
resources. 

This paper will describe the leading ideas 
that motivate OLAC, before focussing on 
the metadata set and the controlled vocabularies 
which implement part (ii) of OLAC's statement of 
purpose. Metadata elements of special interest to 
the language resources community include such 
things as language identification and language 
resource type. The corresponding controlled 
vocabularies ensure consistent description. 
For example, French language resources are 
specified using an official RFC-3066 designation 
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(Alvestrand, 2001), instead of multiple distinct 
text strings like "French", "Francais" and 
"Francais". A separate controlled vocabulary 
exists for resource type, and has items such 
as annotation/phonetic and description/grammar. 
Services for end-users can map controlled 
vocabularies onto convenient terminology for 
any target language. (A live demonstration 
accompanies this presentation.) 

2 Locating Data, Tools and Advice 

We can observe that the individuals who use 
and create language resources are looking 
for three things: data, tools, and advice. By 
DATA we mean any information that documents 
or describes a language, such as a published 
monograph, a computer data file, or even 
a shoebox full of hand-written index cards. 
The information could range in content from 
unanalyzed sound recordings to fully transcribed 
and annotated texts to a complete descriptive 
grammar. By TOOLS we mean computational 
resources that facilitate creating, viewing, 
querying, or otherwise using language data. 
Tools include not just software programs, but also 
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Figure 1: In reality the user can't always get there 
from here 



the digital resources that the programs depend 
on, such as fonts, stylesheets, and document 
type definitions. By ADVICE we mean any 
information about what data sources are reliable, 
what tools are appropriate in a given situation, 
what practices to follow when creating new data, 
and so forth. In the context of OLAC, the term 
language resource is broadly construed to include 
all three of these: data, tools and advice. 

Unfortunately, today's user does not have ready 
access to the resources that are needed. Figure [j] 
offers a diagrammatic view of the reality. Some 
archives (e.g. Archive 1) do have a site on the 
internet which the user is able to find, so the 
resources of that archive are accessible. Other 
archives (e.g. Archive 2) are on the internet, so 
the user could access them in theory, but the user 
has no idea they exist so they are not accessible in 
practice. Still other archives (e.g. Archive 3) are 
not even on the internet. And there are potentially 
hundreds of archives (e.g. Archive n) that the user 
needs to know about. Tools and advice are out 
there as well, but are at many different sites. 

There are many other problems inherent in the 
current situation. For instance, the user may not 
be able to find all the existing data about the 
language of interest because different sites have 
called it by different names (low recall). The 
user may be swamped with irrelevant resources 
because search terms have important meanings in 
other domains (low precision). The user may not 
be able to use an accessible data file for lack of 
being able to match it with the right tools. The 
user may locate advice that seems relevant but 
have no basis for judging its merits. 

2.1 Bridging the gap 



2.1.1 Why improved web-indexing is not 
enough 

As the internet grows and web-indexing 
technologies improve one might hope that 
a general-purpose search engine should be 
sufficient to bridge the gap between people 
and the resources they need, but this is a vain 
hope. The first reason is that many language 
resources, such as audio files and software, are 
not text-based. The second reason concerns 
language identification, the single most important 
property for describing language resources. 
If a language has a canonical name which is 
distinctive as a character string, then the user 
has a chance of finding any online resources 
with a search engine. However, the language 
may have multiple names, possibly due to the 
vagaries of Romanization, such as a language 
known variously as Fadicca, Fadicha, Fedija, 
Fadija, Fiadidja, Fiyadikkya, and Fedicca (giving 
low recall). The language name may collide with 
a word which has other interpretations that are 
vastly more frequent, e.g. the language names 
Mango and Santa Cruz (giving low precision). 

The third reason why general-purpose search 
engines are inadequate is the simple fact that 
much of the material is not, and will not, be 
documented in free prose on the web. Either 
people will build systematic catalogues of their 
resources, or they won't do it at all. Of course, 
one can always export a back-end database 
as HTML and let the search engines index 
the materials. Indeed, encouraging people to 
document resources and make them accessible 
to search engines is part of our vision. However, 
despite the power of web search engines, there 
remain many instances where people still prefer 
to use more formal databases to house their data. 

This last point bears further consideration. The 
challenge is to build a system for "bringing like 
things together and differentiating among them" 
(ISvenonius, 20001). There are two dominant 



storage and indexing paradigms, one exemplified 
by traditional databases and one exemplified 
by the web. In the case of language resources, 
the metadata is coherent enough to be stored in 
a formal database, but sufficiently distributed 
and dynamic that it is impractical to maintain it 
centrally. Language resources occupy the middle 



ground between the two paradigms, neither of 
which will serve adequately. A new framework 
is required that permits the best of both worlds, 
namely bottom-up, distributed initiatives, along 
with consistent, centralized finding aids. The 
Dublin Core (DC) and the Open Archives 
Initiative provide the framework we need to 
"bridge the gap." 

2.1.2 The Dublin Core Metadata Initiative 

The Dublin Core Metadata Initiative began in 
1995 to develop conventions for resource discov- 
ery on the web [ |dubiincore . org ]. The Dublin 
Core metadata elements represent a broad, inter- 
disciplinary consensus about the core set of ele- 
ments that are likely to be widely useful to sup- 
port resource discovery. The Dublin Core consists 
of 15 metadata elements, where each element 
is optional and repeatable: Title, Creator, Subject, 
Description, Publisher, Contributor, Date, Type, For- 
mat, Identifier, Source, Language, Relation, Coverage, 
Rights. This set can be used to describe resources 
that exist in digital or traditional formats. 



In "Dublin Core Qualifiers" 0CMI, 2000^ ) 
two kinds of qualifications are allowed: encoding 
schemes and refinements. An encoding scheme 
specifies a particular controlled vocabulary or 
notation for expressing the value of an element. 
The encoding scheme serves to aid a client system 
in interpreting the exact meaning of the element 
content. A refinement makes the meaning of the 
element more specific. For example, a Language 
element can be encoded using the conventions 
of RFC 3066 to unambiguously identify the 
language in which the resource is written (or 
spoken). A Subject element can be given a 
language refinement to restrict its interpretation 
to concern the language the resource is about. 

2.1.3 The Open Archives Initiative 

The Open Archives Initiative (OAI) was 
launched in October 1999 to provide a common 
framework across electronic preprint archives, 
and it has since been broadened to include digital 
repositories of scholarly materials regardless of 
their type [ [www, openarchives . org | (Lagoze and 



de Sompel, 2001). 

In the OAI infrastructure, each participating 
archive implements a repository - a network 
accessible server offering public access to 
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Figure 2: Bridging the gap through community 
infrastructure 

archive holdings. The primary object in an 
OAI-conformant repository is called an item, 
having a unique identifier and being associated 
with one or more metadata records. Each 
metadata record describes an archive holding, 
which is any kind of primary resource such as 
a document, raw data, software, a recording, 
a physical artifact, a digital surrogate, and so 
forth. Each metadata record will usually contain 
a reference to an entry point for the holding, such 
as a URL or a physical location, as shown in 
Figure |2[ 

To implement the OAI infrastructure, a 
participating archive must comply with two 
standards: the OAI shared metadata set (Dublin 
Core), which facilitates interoperability across 
all repositories participating in the OAI, and the 
OAI metadata harvesting protocol, which allows 
software services to query a repository using 
HTTP requests. 

OAI archives are called "data providers," 
though they are strictly just metadata providers. 
Typically, data providers will also have a 
submission procedure, together with a long-term 
storage system, and a mechanism permitting 
users to obtain materials from the archive. An 
OAI "service provider" is a third party that 
provides end-user services (such as search 
functions over union catalogs) based on metadata 
harvested from one or more OAI data providers. 
Figure [3] illustrates a single service provider 
accessing three data providers (using the OAI 
metadata harvesting protocol). End-users only 
interact with service providers. 

Over the past decade, the Linguist List has 
become the primary source of online informa- 
tion for the linguistics community, reaching out 
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Figure 3: A Service Provider Accessing Multiple 
Data Providers 



to over 13,000 subscribers worldwide, and having 
four complete mirror sites. The Linguist List will 
be augmenting its service by hosting the primary 
service provider for OLAC, and permitting end- 
users to browse distributed language resources at 
a single place. 

2.2 Applying the OAI to language resources 

The OAI infrastructure is a new invention; it has 
the bottom-up, distributed character of the web, 
while simultaneously having the efficient, struc- 
tured nature of a centralized database. This com- 
bination is well-suited to the language resource 
community, where the available data is growing 
rapidly and where a large user-base is fairly con- 
sistent in how it describes its resource needs. 

The primary outcome of the Philadelphia work- 
shop was the founding of the Open Language 
Archives Community, and with it the identifica- 
tion of an advisory board, alpha testers and mem- 
ber archives. Details of these groups are available 
fro m the OLAC site [ www . language-archives . 
org]. 

Recall that the OAI community is defined by 
the archives which comply with the OAI metadata 
harvesting protocol and that register with the OAI. 
Any compliant repository can register as an Open 
Archive, and the metadata provided by an Open 
Archive is open to the public. OAI data providers 
may support metadata standards in addition to 
the Dublin Core. Thus, a specialist community 
can define a metadata format which is specific 
to its domain. Service providers, data providers 
and users that employ this specialized metadata 
format constitute an OAI subcommunity . The 
workshop participants agreed unanimously that 



the OAI provides a significant piece of the infras- 
tructure needed for the language resources com- 
munity. 

In the same way that OLAC represents a 
specialized subcommunity with respect to the 
entire Open Archives community, there are 
specialized subcommunities within the scope 
of OLAC. For instance, the ISLE Meta Data 
Initiative is developing a detailed metadata 
scheme for corpora of recorded speech events 
and their associated descriptions (M PI ISLE 
Team, 2000). Similarly, the language data centers 
- the Linguistic Data Consortium (LDC) and 
the European Language Resources Association 
(ELRA) - are using OLAC metadata as the 
basis of a joint catalog, and will add elements 
and vocabularies for their specialized needs 
(price, rights, and categories of membership 
and use). For archived language resources that 
are of this kind, such a metadata scheme would 
support a richer description. This specialized 
subcommunity can implement its own service 
provider that offers focused searching based on 
its own rich metadata set. At the same time, the 
data providers will exposing OLAC and Dublin 
Core versions of the metadata, permitting the 
resources to be discovered by users of OLAC and 
OAI service providers. 

2.3 Federation and integration of language 
resource archives 

The OAI framework permits archives to 
interoperate. OAI archives support the Dublin 
Core metadata format and metadata harvesting 
protocol. OLAC archives additionally support the 
OLAC metadata format. Widespread adoption 
of these standards will permit language resource 
archives to be federated and integrated. 

First, a collection of archives which support 
the same metadata format can be federated, in 
the sense that a virtual meta-archive can collect 
all the information into a single place, and end- 
users can query multiple archives simultaneously. 
To demonstrate this, the Linguistic Data Consor- 
tium has harvested the catalogs of three language 
resource archives (LDC, ELRA, DFKI) and cre- 
ated a prototype service provider. A search for 
language=Bulgarian returns records from all three 



arch ives, as shown in Figure |4| (B anik and Bird, 
2001). 

Second, a collection of archives which support 
the same metadata format can be integrated, in the 
sense that relational joins can be performed across 
different archives. This permits queries such as: 
"find all lexicon tools that understand a format for 
which Hungarian data is available." 

3 A Core Metadata Set for Language 
Resources 

The OLAC Metadata Set extends the Dublin 
Core set only to the minimum degree required 
to express basic properties of language resources 
which are useful as finding aids. 

All fifteen Dublin Core elements are used in the 
OLAC Metadata Set. In order to suit the specific 
needs of the language resources community, the 
elements have been qualified following principles 
articulated in "Dublin Core Qualifiers" (DCMI, 



2000a) and exemplified in ( DCMI, 2000b| ). 

This section describes some of the attributes, 
elements and controlled vocabularies of the 
OLAC Metadata Set. Before launching into 
this discussion, we first review some XML 
terminology and explain some aspects of the 
OLAC representation which follow directly from 
our choice of XML. 

3.1 Aside: XML representation 

The Extensible Markup Language (XML) is the 
universal format for structured documents and 
data on the Web [mw.w3 . org/XML]. The key 



building block of an XML document is the ele- 
ment. An element has a name, attributes and con- 
tent. Here is an example of an element Language 
with attributes refine and code, and free-text con- 
tent: 

<Language ref ine="OLAC" code="x-sil-BAN"> 
Foreke Dschang</Language> 

In general, XML elements may contain other 
elements, or they may be empty. XML Docu- 
ment Type Definitions (DTDs) and XML schemas 
are grammars that define the structure of a valid 
XML document, and they limit the arrangement 
of XML elements in a document. We believe it 
is important to use a formal mechanism for vali- 
dating a metadata record. Following the OAI, we 



oai:ldc:LDC94T5 








Date: 


1994 






Title: 


ECI Multilingual Text 






Type: 


text 






Identifier: 


1-58563-033-3 






Subject.language: 


Albanian, Bulgarian, Chinese, Czech, Dutch, 


English, Estonian, 




French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, 




Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, 




Serbian, Swedish, Turkish, Tibetan 






Identifier: 


|http ://w w w. ldc . upenn. edu/C atalog/LDC94T5 . html 






Description: 


Recommended Applications: information retrieval, machine transla- 




tion, language modeling 






oai:elra:L0030 








Title: 


Bulgarian Morphological Dictionary 






Date: 


1998 






S ubj ect. language : 


Bulgarian 






Description: 


67,500 entries divided into 242 inflectional types (including proper 




nouns), morphosyntactic information for each entry, and 


a morpho- 




logical engine (MS DOS and WINDOWS 95/NT) for morphological 




analysis and generation 






Identifier: 


|http://www.icp.inpg.fr/ELRA/cata/text_det.html#bulmodic 




oai:dfki:KPML 








Title: 


KPML 






Creator: 


Bateman and many others 






S ubj ect. language : 


Spanish, Russian, Japanese, Greek, German, French, English, Czech, 




Bulgarian 






Format, os : 


Windows NT, Windows 98, Windows 95/98, Solaris 




Type.functionality: 


Software: Annotation Tools, Grammars, Lexica, Development Tools, 




Formalisms, Theories, Deep Generation, Morphological Generation, 




Shallow Generation 






Description: 


Natural Language Generation Linguistic Resource Development and 




Maintenance workbench for large scale generation grammar devel- 




opment, teaching, and experimental generation. Based on systemic- 




functional linguistics. Descendent of the Penman NLG system. 


Identifier: 


fittp://www.purl.org/net/kpml 






Description: 


Contact: bateman@uni-bremen.de 






Relation.requires: 


Windows: none; Solaris: CommonLisp + CLIM 







Figure 4: Querying the Prototype Service Provider for Bulgarian Resources 



use XML schemas to specify the OLAC metadata 
format. 

XML schemas make it possible for element 
content and attribute values to be constrained 
according to the element name. However, XML 
schemas do not permit element content to be 
constrained on the basis of the attribute value. 
Accordingly, in implementing qualified Dublin 
Core using XML, we are limited to using one 
encoding scheme (or controlled vocabulary) per 
element. 

There are two cases we need to consider here. 
In the case where all refinements of an element 
employ the same encoding scheme, we use the 
element name as is and add a refine attribute with 
a fixed value. This documents that the particu- 
lar encoding scheme has been used, and ensures 
that the element cannot be confused with a cor- 
responding unqualified Dublin Core element (see 
the above example). In the case where differ- 
ent refinements of an element employ different 
encoding schemes, then a unique element must 



be defined. Following ( |DCMI, 2000b| ), we define 
such elements by concatenating the Dublin Core 
element name and the refinement name with an 
intervening dot. An example is shown below: 

<Format . encoding code="iso-8859-l"/> 

3.2 Attributes used in implementing the 
OLAC Metadata Set 

Three attributes - refine, code, and lang - are 
used throughout the metadata set to handle most 
qualifications to Dublin Core. Some elements in 
the OLAC Metadata Set use the refine attribute 
to identify element refinements. These quali- 
fiers make the meaning of an element narrower 
or more specific. A refined element shares the 
meaning of the unqualified element, but with a 



more restricted scope (DCMI, 2000a). 

Some elements in the OLAC Metadata Set use 
the code attribute to hold metadata values that are 
taken from a specific encoding scheme. When an 
element may take this attribute, the attribute value 
specifies a precise value for the element taken 
from a controlled vocabulary or formal notation 
(§3.4). In such cases, the element content may 



also be used to specify a freeform elaboration of 
the coded value. 



Every element in the OLAC Metadata Set may 
use the lang attribute. It specifies the language 
in which the text in the content of the element is 
written. The value for the attribute comes from 
a controlled vocabulary OLAC-Language. By 
default, the lang attribute has the value "en", for 
English. Whenever the language of the element 
content is other than English, the lang attribute 
should be used to identify the language. By 
using multiple instances of the metadata elements 
tagged for different languages, data providers 
may offer their metadata records in multiple 
languages. 

In addition, there is a lang attribute on the 
<olac> element that contains the metadata 
elements for a given metadata record. It lists 
the languages in which the metadata record 
is designed to be read. This attribute holds 
a space-delimited list of language codes. By 
default, this attribute has the value "en", for 
English, indicating that the record is aimed only 
at English readers. If an explicit value is given for 
the attribute, then the record is aimed at readers 
of all the languages listed. 

Service providers should use this information 
in order to offer multilingual views of the meta- 
data. When a metadata record lists only one alter- 
native language, then all elements are displayed 
(regardless of their individual languages), unless 
the user has requested to suppress all records in 
that language. When a metadata record has mul- 
tiple alternative languages, the user should be able 
to select one and have display of elements in the 
other languages suppressed. An element in a 
language not included in the list of alternatives 
should always be displayed (for instance, the ver- 
nacular title of a work). 

3.3 The elements of the OLAC Metadata Set 

In this section we present a synopsis of the 
elements of the OLAC metadata set. For each 
element, we provide a one sentence definition 
followed by a brief discussion, systematically 
borrowing and adapting the definitions provided 
by t he Dublin Core Metadata Initiative (D CMI, 
1999). Each element is optional and repeatable. 

Contributor: An entity responsible for making 
contributions to the content of the 
resource. Examples of a Contributor 



include a person, an organization, or a 
service. The refine attribute is optionally 
used to specify the role played by the named 
entity in the creation of the resource, using 
the controlled vocabulary OLAC-Role. 
Coverage: The extent or scope of the content 
of the resource. Coverage will typically 
include spatial location or temporal period. 
Where the geographical information is pre- 
dictable from the language identification, it 
is not necessary to specify geographic cov- 
erage. 

Creator: An entity primarily responsible for 
making the content of the resource. The 

refine attribute is optionally used to specify 
the role played by the named entity in the 
creation of the resource, using the controlled 
vocabulary OLAC-Role. 
Date: A date associated with an event in the life 
cycle of the resource. The refine attribute is 
optionally used to refine the meaning of the 
date using values from a controlled vocab- 
ulary (for instance, date of creation versus 
date of issue versus date of modification, and 
so on). The vocabulary for refinements to 



Date is defined in (DCMI, 2000a) 



Description: An account of the content of the 
resource. Description may include but is not 
limited to: an abstract, table of contents, ref- 
erence to a graphical representation of con- 
tent, or a free-text account of the content. 

Format: The physical or digital manifestation 
of the resource. Typically, Format may 
include the media-type or dimensions of the 
resource. Format may be used to determine 
the software, hardware or other equipment 
needed to use the resource. The code 
attribute identifies the format using the 
controlled vocabulary OLAC-Format. 

Format.cpu: The CPU required to use a soft- 
ware resource. The code attribute identi- 
fies the CPU using the controlled vocabulary 
OLAC-CPU. 

Format.encoding: An encoded character set 
used by a digital resource. For a digitally 
encoded text, Format.encoding names the 
encoded character set it uses. For a font, 
Format.encoding names an encoded character 
set that it is able to render. For a software 



application, Format.encoding names an 
encoded character set that it can read or 
write. The code attribute is used to identify 
the character set using the controlled 
vocabulary OLAC-Encoding. 

Format.markup: The OAI identifier for the 
definition of the markup format. 
Format.markup provides an OAI identifier 
for an XML DTD, schema or some other 
definition of the markup format. (This has 
the side-effect of ensuring that the format 
definition is archived somewhere). For a 
software resource, Format.markup names a 
markup scheme that it can read or write. 
The code attribute identifies the markup 
scheme using the controlled vocabulary 
OLAC-Markup. 

Format.os: The operating system required to 
use a software resource. The code attribute 
is used to identify the operating system 
using the controlled vocabulary OLAC-OS. 
Additional restrictions for operating system 
version, may be specified using the element 
content. 

Format.sourcecode: The programming lan- 
guage^) of software distributed in source 
form. The code attribute identifies the 
language using the controlled vocabulary 
OLAC-Sourcecode. 

Identifier: An unambiguous reference to 
the resource within a given context. 
Recommended best practice is to identify 
the resource by means of a string or number 
conforming to a globally-known formal 
identification system (e.g. URIs, ISBNs). 
For non-digital archives, Identifier may use 
the existing scheme for locating a resource 
within the collection. 

Language: A language of the intellectual 
content of the resource. Language is 
used for a language the resource is in, as 
opposed to the language it describes (see 
Subject.language). It identifies a language 
that the creator of the resource assumes 
that its eventual user will understand. The 
code attribute is used to make a precise 
identification of the language using the 
controlled vocabulary OLAC-Language. 



Publisher: An entity responsible for making the 
resource available. Examples of a publisher 
include a person, an organization, or a ser- 
vice. 

Relation: A reference to a related resource. This 
element is used to document relationships 
between resources. The refine attribute is 
used to refine the nature of the relationship 
using values from a controlled vocabulary 
(for instance, is replaced by, requires, is 
part of, and so on). The vocabulary for 
refine ments to Relation is defined in ( DCMI, 
2000a). 

Rights: Information about rights held in and 
over the resource. Typically, a Rights ele- 
ment will contain a rights management state- 
ment for the resource, or reference a service 
providing such information. Rights informa- 
tion often encompasses intellectual property 
rights (IPR), copyright, and various property 
rights. The code attribute is used to make 
a summary statement about rights using the 
controlled vocabulary OL AC -Rights. 

Rights.software: Information about rights held 
in and over a software resource. A rights 
statement pertaining to software, using the 
controlled vocabulary OLAC-Software- 
Rights. 

Source: A reference to a resource from which 
the present resource is derived. For 

instance, it may be the bibliographic 
information about a printed book of which 
this is the electronic encoding or from which 
the information was extracted. 

Subject: The topic of the content of the 
resource. Typically, a Subject will be 
expressed as keywords, key phrases or 
classification codes that describe a topic of 
the resource. Recommended best practice 
is to select a value from a controlled 
vocabulary or formal classification scheme. 

Subject.language: A language which the content 
of the resource describes or discusses. As 
with the Language element, a code attribute 
is used to identify the language precisely. 

Title: A name given to the resource. Typically, 
a title will be a name by which the resource 
is formally known. A translation of the title 
can be supplied in a second Title element. 



The lang attribute is used to identify the lan- 
guage of these elements. 

Type: The nature or genre of the content of 
the resource. The code attribute is used to 
identify the type using the Dublin Core con- 
trolled vocabulary DC-Type. 

Type.data: The nature or genre of the content of 
the resource, from a linguistic standpoint. 
Type includes terms describing general cate- 
gories, functions, genres, or aggregation lev- 
els for content. The code attribute is used to 
identify the type using the controlled vocab- 
ulary OLAC-Data. 

Type.functionality: The functionality of a 
software resource. The code attribute 
is used to identify the type using the 
controlled vocabulary OLAC -Functionality. 

Observe that some elements, such as Format, 
Format. encoding and Format. markup are applicable 
to software as well as to data. Service providers 
can exploit this feature to match data with appro- 
priate software tools. 

3.4 The controlled vocabularies 

Controlled vocabularies are enumerations of legal 
values for the code attribute. In some cases, more 
than one value applies, in which case the corre- 
sponding element must be repeated, once for each 
applicable value. In other cases, no value is appli- 
cable ands the corresponding element is simply 
omitted. In yet other cases, the controlled vocab- 
ulary may fail to provide a suitable item, in which 
case a similar item can be optionally specified and 
a prose comment included in the element content. 

3.4.1 OLAC-Language 

Language identification is an important 
dimension of language resource classification. 
However, the character-string representation 
of language names is problematic for several 
reasons: different languages (in different parts 
of the world) may have the same name; the 
same language may have a different name in 
each country where it is spoken; within the 
same country, the preferred name for a language 
may change over time; in the early history of 
discovering new languages (before names were 
standardized), different people referred to the 
same language by different names; and for 



languages having non-Roman orthographies, 
the language name may have several possible 
romanizations. Together, these facts suggest that 
a standard based on names will not work. Instead, 
we need a standard based on unique identifiers 
that do not change, combined with accessible 
documentation that clarifies the particular speech 
variety denoted by each identifier. 

The information technology community has a 
standard for language identification, namely, ISO 
639 (IISO, 19981). Part 1 of this standard lists 



two-letter codes for identifying 160 of the world's 
major languages; part 2 of the standard lists three- 
letter codes for identifying about 400 languages. 
ISO 639 in turn forms the core of another stan- 
dard, RFC 3066 (formerly RFC 1766), which is 
the standard used for language identification in 
the xmhlang attribute of XML and in the language 
element of the Dublin Core metadata set. RFC 
3066 provides a mechanism for users to register 
new language identification codes for languages 
not covered by ISO 639, but very few additional 
languages have been registered. 

Unfortunately, the existing standard falls 
far short of meeting the needs of the language 
resources community since it fails to account for 
more than 90% of the world's languages, and 
it fails to adequately document what languages 
the codes refer to ([Simons, 20001). However, 



SIL's Ethnologue (£ rimes, 2000| ) provides a 
complete system of language identifiers which 
is openly available on the Web. OLAC will 
employ the RFC 3066 extension mechanism 
to build additional language identifiers based 
on the Ethnologue codes. For the 130-plus 
IS 0-639-1 codes having a one-to-one mapping 
onto Ethnologue codes, OLAC will support both. 
Where an ISO code is ambiguous - such as mhk 
for "other Mon Khmer languages" - OLAC will 
require the Ethnologue code. New identifiers 
for ancient languages, currently being developed 
by LINGUIST List, will be incorporated. These 
language identifiers are expressed using the code 
attribute of the Language and Subject.language 
elements. The free-text content of these elements 
may be used to specify an alternative human- 
readable name for the language (where the name 
specified by the standard is unacceptable for 



some reason) or to specify a dialect (where the 
resource is dialect-specific). 

3.4.2 OLAC-Data 

After language identification, another dimen- 
sion of central importance is the linguistic type of 
a resource. Notions such as "lexicon" and "gram- 
mar" are fundamental to OLAC, and the discourse 
of the language resources community depends on 
shared assumptions about what these types mean. 

We believe that it is helpful to distinguish at 
least four top-level types: transcription, annota- 
tion, description and lexicon, each defined broadly 
as proposed below. A transcription is any time- 
ordered symbolic representation of a linguistic 
event. An annotation is any kind of structured 
linguistic information that is explicitly aligned to 
some spatial and/or temporal extent of a linguistic 
record (such as a recorded signal or an image). 
A description is any description or analysis of a 
language; unlike a transcription or an annotation, 
the structure of a description is independent of the 
structure of the linguistic events that it describes. 
A lexicon is any record-structured inventory of lin- 
guistic forms. 

For each of these top-level types we envision a 
more specific vocabulary to facilitate greater pre- 
cision. For example, an orthographic transcrip- 
tion would have the code transcription/orthographic. 
Other subtypes could include: phonetic, prosodic, 
morphological, gestural, part-of-speech, syntactic, dis- 
course, musical. The annotation type would include 
these subtypes, and add others to cover spatial 
annotation of images (e.g. for OCR annotation 
of textual images or for isogloss maps). 

The description type could have subtypes for 
grammatical, phonological, orthographic, paradigms, 
pedagogical, dialectal and comparative. The lexi- 
con type could also carry subtypes to distinguish 
wordlists, wordnets, thesauri and so forth. 

3.4.3 Other controlled vocabularies 

OLAC-CPU: A vocabulary for identifying the 
CPU(s) for which the software is available, 
in the case of binary distributions: x86, mips, 
alpha, ppc, spare, 680x0. 

OLAC-Encoding: A vocabulary for identifying 
the character encoding used by a digital 
resource, e.g. iso-8859-1, ... 



<?xml version="l . 0" encoding="UTF-8 " ?> 
<olac 

xmlns="http : //www . language-archives . org/OLAC/0 . 3/ " 

xmlns : xsi = "http : //www . w3 . org/2 00 1/XMLSchema-instance " 

xsi : schemaLocation="http : / /www . language-archives . org/ OLAC/ 0.3/ 

\protect \vrule widthOpt\protect\href{http:/ /www . language- archives .org/ OLAC/ olac-0 . 3b 
<Title>KPML</Title> 

<Identifier>\protect\vrule width Opt \protect \href { http : / /www . purl . org/ net/kpml/</ Identifier} { http : / 

<Creator ref ine="Author">Bateman, John</Creator> 

<Subject . language code="es"/> <Subject . language code="ru"/> 

<Subject . language code="ja"/> <Subject . language code="el"/> 

<Subject . language code="de"/> <Subject . language code="fr"/> 

<Subject . language code="en"/> <Subject . language code="cs"/> 

<Subject . language code="bg"/> 

<Format.os code="MSWindows/winNT" /> <Format.os code="MSWindows/win95 " /> 
<Format.os code="MSWindows/win98 " /> <Format.os code="Unix/Solar is " /> 
<Type . f unctionality>Annotation Tools, Grammars, Lexica, Development Tools, 

Formalisms, Theories, Deep Generation, Morphological Generation, 

Shallow Generation</ type . f unctionality> 
<Relation ref ine="Requires">Windows : none; Solaris: CommonLisp + CLIM</Relation> 
<Description>Natural Language Generation Linguistic Resource Development and 

Maintenance workbench for large scale generation grammar development, 

teaching, and experimental generation. Based on systemic-functional 

linguistics. Descendent of the Penman NLG system. </Description> 
</olac> 



Figure 5: OLAC Metadata Record for KPML 



OLAC-Format: A vocabulary for identifying 
the manifestation of the resource. The 
representation is inspired by MIME types, 
e.g. text/sf for SIL standard format. 
(Format.markup is used to identify the 
particular tagset.) It may be necessary 
to add new types and subtypes to cover 
non-digital holdings, such as manuscripts, 
microforms, and so forth and we expect to be 
able to incorporate an existing vocabulary. 

OLAC-Functionality: A vocabulary for clas- 
sifying the functionality of software, again 
using the MIME style of representation, 
and using the HLT Survey as a source of 



categories (Cole, 1997) as advocated by 
the ACL/DFKI Natural Language Software 
Registry. For example, written/OCR would 
cover "written language input, print or 
handwriting optical character recognition." 

OLAC-OS: A vocabulary for identifying the 
operating system(s) for which the software 
is available: Unix, MacOS, OS2, MSDOS, 
MSWindows. Each of these has optional 
subtypes, e.g. Unix/Linux, MSWindows/winNT. 

OLAC-Rights: A vocabulary for classifying 
the rights held over a resource, e.g.: open, 
restricted, ... 



OLAC-Role: A vocabulary for identifying 
the role of a contributor or creator of the 
resource, e.g.: author, editor, translator, 
transcriber, sponsor, ... 

OLAC-Software-Rights: A vocabulary for 
classifying the rights held over a resource, 
e.g.: open-source, royalty-free-library, royalty- 
free-binary, commercial, ... 

OLAC-Sourcecode: A vocabulary for identify- 
ing the programming language(s) used by 
software which is distributed in source form, 
e.g.: C++, Java, Python, Tel, VB, ... 

4 XML Representation 

The OLAC metadata format consists of an XML 
schema for the element set, and a set of schemas 
for the controlled vocabularies. The latest ver- 
sions are available from the OLAC website. 

Figure |5| shows the OLAC metadata record cor- 
responding to the KPML display from Figure ||. 
The top element is olac; this references the XML 
namespace for version 0.3b 1 of the schema. The 
contents of the olac element are the OLAC meta- 
data elements, which are optional and repeatable, 
and can occur in any order, as in Dublin Core. 

Some elements employ the optional code or 
refine attributes, and/or free-text content. The 



third attribute, lang, is not used here since the free- 
text content is in English (specified in the XML 
schema as the default). For the Creator element, 
the refine attribute narrows the meaning of cre- 
ator to Author. For the Subject.language elements, 
the code attribute specifies nine languages using 
Ethnologue codes. A service provider would map 
these codes to human-readable names. 

The Format.os element illustrates a two-level 
coding scheme, consisting of an OS "family", 
followed by a specific operating system. Further 
details can be included in the free-text content 
if necessary. If a piece of software runs on all 
members of an OS family, then the more detailed 
designation can be omitted, e.g. code="Unix". 
The Type.functionality element is specified using 
free-text content, since the details of the 
controlled vocabulary OLAC -Functionality are 
still being worked out. 

5 Conclusions 

The OLAC Metadata Set and controlled vocabu- 
laries are works in progress, and are continuing to 
be revised with input from participating archives 
and members of the wider language resources 
community. We hope to have provided sufficient 
motivation and exemplification for our choices so 
that readers will easily be able to contribute to 
ongoing developments. 

Even once OLAC is completely in place, there 
will still be documentation tasks which the cre- 
ators of language resources will have to under- 
take, and new habits to acquire. It will always be 
necessary to identify and manually correct incon- 
sistent or erroneous metadata. The OLAC con- 
trolled vocabularies will need to be refined indef- 
initely in response to changes in the world around 
us. The creators of language resources will need 
to generate metadata with each new resource and 
place the resource in a suitable archive. The 
communities will need to adopt best practices for 
archival storage formats. 

Despite these intrinsic limitations, the OLAC 
Metadata Set and controlled vocabularies offer a 
template for resource description, providing two 
clear benefits over traditional full-text descrip- 
tion and retrieval. First, the template guides the 
resource creator in giving a complete description 
of the resource, in contrast to prose descriptions 



which may omit important details. And second, 
the template associates a resource with standard 
labels, such as creator and title, permitting users 
to do focussed searching. Resources and reposi- 
tories can proliferate, yet common metadata and 
vocabularies will support centralized services giv- 
ing users easy access to language resources. 
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